Model Selection

Cross-modal retrieval

# Cross-modal retrieval

A vision-language model fine-tuned based on SigLIP 2, with maximum text length increased from 64 to 256 tokens

Transformers English

LLM2CLIP Openai L 14 224

LLM2CLIP is an innovative approach that leverages large language models (LLMs) to unlock the potential of CLIP. It enhances text discriminability through a contrastive learning framework, breaking the limitations of the original CLIP text encoder.

LLM2CLIP Openai B 16

LLM2CLIP is an innovative method that leverages large language models (LLMs) to extend CLIP's capabilities, enhancing text discriminability through a contrastive learning framework and significantly improving cross-modal task performance.

LLM2CLIP EVA02 L 14 336

LLM2CLIP is an innovative approach that enhances CLIP's visual representation capabilities through large language models (LLMs), significantly improving cross-modal task performance

Safeclip Vit L 14

Safe-CLIP is an enhanced vision-language model based on CLIP, designed to mitigate risks associated with NSFW (Not Safe For Work) content in AI applications.

Nllb Siglip Mrl Large

NLLB-SigLIP-MRL is a multilingual vision-language model that combines the text encoder from NLLB and the image encoder from SigLIP, supporting 201 languages from Flores-200.

Nllb Siglip Mrl Base

A multilingual vision-language model combining NLLB text encoder and SigLIP image encoder, supporting 201 languages and multiple embedding dimensions

Owlvit Tiny Non Contiguous Weight

OWL-ViT is a vision Transformer-based open-vocabulary object detection model capable of detecting categories not present in the training dataset.

Nllb Clip Base Siglip

NLLB-CLIP-SigLIP is a multilingual vision-language model that combines the text encoder from NLLB and the image encoder from SigLIP, supporting 201 languages.

Nllb Clip Large Siglip

NLLB-CLIP-SigLIP is a multilingual vision-language model that combines the text encoder of the NLLB model and the image encoder of the SigLIP model, supporting 201 languages.

Metaclip L14 400m

MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.

Metaclip L14 Fullcc2.5b

MetaCLIP is a large-scale vision-language model trained on 2.5 billion data points from CommonCrawl (CC), revealing CLIP's data filtering methodology

Metaclip B16 400m

MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces

Metaclip B32 Fullcc2.5b

MetaCLIP is a vision-language model trained on 2.5 billion data points from CommonCrawl (CC) to construct a shared image-text embedding space.

Nllb Clip Base Oc

NLLB-CLIP is a multilingual vision-language model combining the NLLB text encoder with the CLIP image encoder, supporting 201 languages

Languagebind Audio

LanguageBind is a language-centric multimodal pre-training method that extends video-language pre-training to N modalities through language semantic alignment, achieving high-performance multimodal understanding and alignment.

Multimodal Alignment

CLIP ViT L 14 CommonPool.XL.clip S13b B90k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval

AltCLIP-m18 is a CLIP model supporting 18 languages for image-text matching tasks.

CLIPfa is the Persian version of OpenAI's CLIP model, connecting Persian text and image representations through contrastive learning

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase